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AB MOTIVATION: The enormous amount of protein sequence data uncovered by 
genome research has increased the demand for computer software that 
can automate the recognition of new proteins. We discuss the relative 
merits of various automated methods for recognizing G-Protein Coupled 
Receptors (GPCRs), a superfamily of cell membrane proteins. GPCRs are 
found in a wide range of organisms and are central to a cellular 
signalling network that regulates many basic physiological processes. 
They are the focus of a significant amount of current pharmaceutical 
research because they play a key role in many diseases. However, their 
tertiary structures remain largely unsolved. The methods described in 
this paper use only primary sequence information to make their 
predictions. We compare a simple nearest neighbor approach (BLAST), 
methods based on multiple alignments generated by a statistical profile 
Hidden Markov Model (HMM), and methods, including Support Vector 
Machines (SVMs), that transform protein sequences into fixed-length 
feature vectors. RESULTS: The last is the most computationally 
expensive method, but our experiments show that, for those interested in 
annotation-quality classification, the results are worth the effort. In 
two-fold cross-validation experiments testing recognition of GPCR 
subfamilies that bind a specific ligand (such as a histamine molecule), 
the errors per sequence at the Minimum Error Point (MEP) were 13.7% for 
multi-class SVMs, 17. 1% for our SVMtree method of hierarchical multi-class 
SVM classification, 25.5% for BLAST, 30% for profile HMMs, and 49% for 
classification based on nearest neighbor feature vector Kernel Nearest 
Neighbor (kernNN). The percentage of true positives recognized before the 
first false positive was 65% for both SVM methods, 13% for BLAST, 5% for 
profile HMMs and 4% for kernNN. 


